Machine learning the Hohenberg-Kohn map for molecular excited states

The Hohenberg-Kohn theorem of density-functional theory establishes the existence of a bijection between the ground-state electron density and the external potential of a many-body system. This guarantees a one-to-one map from the electron density to all observables of interest including electronic excited-state energies. Time-Dependent Density-Functional Theory (TDDFT) provides one framework to resolve this map; however, the approximations inherent in practical TDDFT calculations, together with their computational expense, motivate finding a cheaper, more direct map for electronic excitations. Here, we show that determining density and energy functionals via machine learning allows the equations of TDDFT to be bypassed. The framework we introduce is used to perform the first excited-state molecular dynamics simulations with a machine-learned functional on malonaldehyde and correctly capture the kinetics of its excited-state intramolecular proton transfer, allowing insight into how mechanical constraints can be used to control the proton transfer reaction in this molecule. This development opens the door to using machine-learned functionals for highly efficient excited-state dynamics simulations.


Supplementary Method 1. Kernel Ridge Regression
As described in the main text, we use Kernel Ridge Regression (KRR) to machine learn excited-state density functionals. The key equations of KRR are briefly reviewed here in terms of abstract training points (x i , y i ), where x 1 , ...x M ∈ R d are the features and Y = (y 1 , ...y M ) T ∈ R M their respective labels. We seek a function that maps from the features to the labels f : R d → R, and this is expressed as a linear expansion of the kernel function, κ: where α i are the model coefficients to be found, and the kernel has a Gaussian form: Here σ controls the width of the Gaussian kernel, and is optimized as a hyperparameter during the model training. The model is trained by minimizing the following loss function: where K ij = κ (x i , x j ) is the kernel matrix, ∥f ∥ H is the reproducing kernel Hilbert space norm, and λ is a regularization parameter which, like σ, is optimized as a hyperparameter. The solution of minimizing the loss function (Supplementary Eq. 3) is In the context of the ML-KS map, the training features are discretized Gaussians potentials and the respective labels are their associated excited-state energies. To learn the electron density representations, the training features are the discretized Gaussians potentials and the respective labels are the set of Fourier coefficients for each energy state. In the context of the energy functionals, the training features are the expansion coefficients of the electronic density in a Fourier basis representation and the labels are the associated excited-state energies.

Supplementary Method 2. ML multistate Hohenberg-Kohn map
As described in Section IV A, we use a basis representation of the electronic densities for a particular electronic state, given by where ϕ l are the Fourier basis functions of which there are L. To train the model, first the basis set coefficients, {u (l) }, are found by a Fourier transform of the density from a regular real-space grid representation: where {r m } are the real-space grid points on which the density is evaluated, k l is the wavevector of Fourier function l, and B are the real-space grid box lengths. In our implementation, we instead perform sine and cosine transformations (equivalent to a Fourier transform) so that the frequency domain density coefficients are purely real numbers. If L x is the number of sine and cosine functions of the x coordinate, then the total number of coefficients is L = L x × L y × L z . For a particular electronic state, the loss function for machine-learned densities follows similarly to the regular KRR model: By writing the ML model with basis function coefficients and a kernel expansion as in section IV A, , the loss function becomes: Here, κ[v i , v k ] is a kernel functional given by Supplementary Eq. 2 with ∥·∥ representing the function norm of the Using the orthogonality of the basis functions yields the solution of which is analogous to that of regular KRR: where λ is a global regularization hyperparameter and K σ is the Gaussian kernel with a global kernel width hyperparameter, σ. These hyperparameters were optimized in a grid search (following Ref. 1) with the standard 5-fold cross-validation procedure. [2,3] Supplementary Method 3. Clustering the training set For a molecule with several degrees of freedom, like malonaldehyde, the accuracy of a machine-learned density functional will depend on both the size of the training set and how well the training set samples the important conformational space of the molecule of interest. To generate an efficient training set for planar malonaldehyde, we follow previous work [1] and employ K-means clustering to identify the important and diverse samples that we then include in the training set to cover the relevant conformational space as much as possible with a limited number of conformers.
Assuming we have conformers labelled i = 1...N with parameters p i , we seek to find M clusters,P j , j = 1...M , that minimize the following objective function: where i ∈P j if and only ifp j is the cluster center closest to p i . In principle, p i can be any properties of the conformer. In this work we used the Gaussian representation of the external potential v(r), since this property enters the Gaussian kernel of Supplementary Eq. 9, and exhibits desirable features, such as permutational invariance. The K-means algorithm provides a solution to Supplementary Eq. 11, [4] and we select the M samples closest to the cluster centers.

Supplementary Note 1. Convergence of the model with sample size
To generate a learning curve that reflects the benefits of adding training data from an increasing number of nonequilibrium excited-state trajectories, we used an iterative approach to expand our training set. Starting with the 1000 representative ground state samples which were clustered and aligned from the 2000 ground state samples as described in Section IV D, 480 aligned geometries from a single S 2 trajectory were added to the training set pool. All 1480 geometries were clustered with the K-means algorithm to identify 1050 trajectory snapshots closest to the cluster centers, i.e. the training set size was expanded by 50 samples each iteration. This procedure was iterated 30 times, eventually yielding 2500 samples before symmetrization, and the 31 training sets generated from this procedure were used in Fig. 1 to get the mean absolute error on the test set of S 2 excited state geometries. In this way, information from one new trajectory was added to the model at a time. The learning curve showed that when we had added 30 excited state trajectories, the out-of-sample error was already below 0.2 kcal mol −1 . In the final training set, 817 of the original unsymmeterized ground state snapshots survived, with the remainder of the samples comings from the added excited-state trajectories.

Supplementary Note 2. Two additional frameworks for learning excited-state functionals
In addition to the ML-MSHK framework using multiple states in the training described in the main text, we also tested two different frameworks for predicting excited-state energies. Both approaches can only be used to predict the S 2 excited state energies. We call them ML-  The potential energy profile in Fig. 5 was generated with the CIOpt software package. [5] The geometries corresponding to each point along the curves in this figure were optimized in internal coordinates for a series of fixed proton-transfer coordinates (r − ) and oxygen-oxygen distances (d OO ), subject to a constraint of planarity. The electronic structure of the S 2 state was found using TD-PBE0 [6,7] with the aug-cc-pvdz basis set [8] as implemented in TeraChem [9,10]. With the minimal energy pathways found, TD-PBE0 energies were evaluated along the pathways using CPMD [11], in order to make a direct connection to our ML-MSHK model, which was trained on CPMD energies.
Supplementary Note 4. C 2v geometry for alignment As discussed in section IV D, each sample is aligned to a C 2v geometry representing an idealized planar protontransfer transition state of malonaldehyde. The detailed coordinates of this C 2v geometry are shown here.

Supplementary Note 5. Learning excited-state functionals for non-planar Malonaldehyde
In the main text, we considered planar malonaldehyde, which maintains a large S 2 -S 1 energy gap for the duration of the excited-state proton transfer reaction. In this section, we explore the ability of machine-learned excited-state functionals to predict densities and energies relevant to the S 2 excited-state dynamics of MA in the absence of a planar restraint. Previous theoretical studies predict out-of-plane motions on the S 2 state that lead to electronic crossings with lower states. [12] This molecule therefore serves as a useful test of the ability of machine-learned excited-state functionals to describe electronic near-degeneracies. We consider two models: the ML-MSHK model mentioned in the main text and the ML-ESHK model described in Supplementary Note 2. Here the ML-ESHK model uses separate maps to predict S 1 and S 2 densities and energies, i.e. ML-ESHK[n 1 ] and ML-ESHK[n 2 ], while the ML-MSHK model simultaneously predicts S 1 and S 2 energies. Because the S 0 /S 1 energy gap is larger than 88 kcal/mol along all the ab initio trajectories with TD-PBE0 electronic structure, i.e. S 0 /S 1 and S 0 /S 2 crossings are not observed, we consider only S 1 and S 2 predictions. Not including S 0 densities reduces the storage requirements of the training set.
To generate training and test sets for non-planar MA, we performed 100 additional independent AIMD trajectories on the S 2 state (i.e. adiabatic dynamics) without any restraint of planarity. Initial conditions were taken following vertical excitations spaced every 100 fs from the same AIMD ground-state trajectory used in the main text. Dynamics was propagated for 60 fs using CPMD. [11] Otherwise, all other aspects of the simulations were identical to those described in Section IV C of the main text.
Next, a series of increasingly larger grand training sets were generated by augmenting 5,000 samples from the planar MA training set described in the main text with structures taken from every timestep of the first 20 to 90 non-planar excited-state AIMD trajectories. Each trajectory contributed 240 structures, so we considered grand training sets of between 9,800 and 26,600 samples. 2,400 samples from the remaining ten excited-state AIMD trajectories were used as a test set.
To reduce the size of the training set, we used the dataset reduction protocol of Smith et al. [13] Briefly, starting from 5% of structures randomly selected from the grand training set, an excited-state functional was trained. The energy predictions for the remaining training samples in the grand training set were then evaluated. 2% of samples that had absolute errors in energy (averaged over both S 1 and S 2 in the case of ML-MSHK) of greater than a threshold of 0.3 kcal/mol were randomly selected and added to the training set and the model retrained. Energy predictions for the remaining samples were re-evaluated and another 2% of samples that had energy errors over the threshold were randomly selected and added to the training set. The process was repeated iteratively until fewer than 5% of the remaining samples had errors above the threshold, at which point, the remaining samples with errors above the threshold were added to the training set.
To further reduce the storage requirements of the training set, compared to the grid used for planar MA, we pruned the number of real-space grid points on which the external potential was evaluated (for use in eq. 6) to 48 × 40 × 24 with a spacing of 0.25Å. We verified this grid was sufficiently large to span the entire molecule with a buffer of at least 2Å from any atom to the edge of the grid, even for non-planar geometries. We also confirmed that the increase in grid spacing did not affect the out-of-sample test error.
To improve the learning of densities and assign the appropriate energies, we found it necessary to incorporate a simple state-tracking algorithm when labelling the ab initio states as S 1 or S 2 . At each excited-state AIMD timestep, we compared the excited-state densities, expanded in a basis of 27,000 (30 × 30 × 30) Fourier functions, to the previous timestep's (t − δt) densities by computing the inner product, where l runs over the Fourier basis coefficients, and j and k run over the state indices (S 1 and S 2 in our case). If S 1,2 (t) + S 2,1 (t) > S 1,1 (t) + S 2,2 (t), the adiabatic state labels at timestep t were swapped compared to their ordering at the previous timestep. This tracking accounted for the possibility that the ordering of the electronic states switch as a trajectory traverses an electronic state crossing.
With the sample reduction and state-tracking protocols described above, we generated learning curves for the ML-MSHK and ML-ESHK models by varying the grand training set size from 20 to 90 trajectories, i.e. 9,800 to 26,600 total samples, and evaluating out-of-sample errors on the test set, with the results shown in Supplementary Figure  2. Considering first the ML-ESHK predictions (green curves), the sample reduction protocol is seen to significantly compress the training set; however, the out-of-sample error is much larger than the threshold error. Furthermore, the errors in ML-ESHK appear to saturate and do not improve below 1.5 kcal/mol. The ML-MSHK learning curve, on the other hand, shows that a larger number of samples survive the reduction procedure compared to ML-ESHK; however, the errors now decrease monotonically with increasing size of the grand training set. In particular, the S 1 and S 2 energy predictions attain chemical accuracy compared to the reference ab initio electronic structure (MAEs of 0.90 and 0.83 kcal/mol respectively) at the largest grand training set size (26,600 initial samples and 9,000 samples after reduction).
To confirm that ML-MSHK provides a consistently robust excited-state functional and to understand the lower errors in ML-MSHK compared to ML-ESHK, we plot the predicted excited-state energies along an AIMD trajectory from the test set near S 2 /S 1 electronic crossings between t = 30 fs to t = 60 fs, shown in Supplementary Figure 3. The largest training set following data reduction was used for each model.
ML-MSHK predictions (blue curves) are seen to almost perfectly reproduce the AIMD energies (red curves) for the entire trajectory. On the other hand, the ML-ESHK model (green curves) displays deviations between the predicted and AIMD energies that exceed 2 kcal/mol. In particular, the ML-MSHK prediction error is seen to be noticeably larger after first passing through a crossing region of the electronic states at t = 42 fs. To quantify this, we calculated the energy differences before and after the first state crossing, defined as when the S 1 /S 2 gap first drops below 10 kcal/mol. After averaging over the test set, we found that the mean absolute error of ML-MSHK is 0.2 kcal/mol before the first state crossing and 1.5 kcal/mol after it. For ML-ESHK, the errors are 0.5 kcal/mol before and 2.4 kcal/mol after. The larger errors for ML-ESHK can be understood as arising from the rapidly changing nature of the electronic density near state crossings, such that the ML-ESHK energy functional, which is trained only on a single state at a time, has insufficient training data to accurately predict the state energy near and following a state crossing. On the other hand, since the ML-MSHK functional is trained simultaneously on both electronic states, it is better able to capture the correct functional dependence on electronic densities in the crossing regions. This clearly demonstrates the benefit of the ML-MSHK model over ML-ESHK.